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CONDITIONAL DENSITY ESTIMATION IN 
A REGRESSION SETTING 1 

By Sam Efromovioh 
The University of Texas at Dallas 

Regression problems are traditionally analyzed via univariate char- 
acteristics like the regression function, scale function and marginal 
density of regression errors. These characteristics are useful and in- 
formative whenever the association between the predictor and the 
response is relatively simple. More detailed information about the as- 
sociation can be provided by the conditional density of the response 
given the predictor. For the first time in the literature, this article 
develops the theory of minimax estimation of the conditional den- 
sity for regression settings with fixed and random designs of predic- 
tors, bounded and unbounded responses and a vast set of anisotropic 
classes of conditional densities. The study of fixed design regression is 
of special interest and novelty because the known literature is devoted 
to the case of random predictors. For the aforementioned models, the 
paper suggests a universal adaptive estimator which (i) matches per- 
formance of an oracle that knows both an underlying model and an 
estimated conditional density; (ii) is sharp minimax over a vast class 
of anisotropic conditional densities; (iii) is at least rate minimax when 
the response is independent of the predictor and thus a bivariate con- 
ditional density becomes a univariate density; (iv) is adaptive to an 
underlying design (fixed or random) of predictors. 

1. Introduction. Let (Yi,Xi), I = 1, . . . ,n, be independent pairs of obser- 
vations (bivariate data). We would like to analyze a relationship (associa- 
tion) between variables X\ (the predictor) and Y\ (the response) that allows 
one to quantify the input of X\ on Yj. To simplify the problem, the non- 
parametric regression literature recommends analysis of the association via 
the conditional expectation of the response given the predictor because this 
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implies estimation of a well-understood univariate function. In practical ap- 
plications, this simplification may or may not fully describe the association; 
see discussion in [1, 5, 14, 16, 21, 31]. In general, the conditional density of 
the response given the predictor describes the ultimate association between 
the response and the predictor. However, this is a bivariate function and its 
estimation is complicated by the curse of dimensionality, the latter necessi- 
tating development of optimal estimators. The literature on such estimators 
is next to nothing, and the aim of this article is to develop minimax and 
oracle theory of estimation of conditional densities. 

Let us formulate the problem of estimation of the conditional density 
(in what follows, the abbreviation c.d. will often be used) considered in 
this paper. We would like to estimate the c.d. of the response Y given the 
predictor X in the following regression settings. First, we need to take into 
account two possible models of design of predictors. The first model is where 
pairs of observations are independent samples from a pair of two random 
variables Y and X. Then, if the joint density f(y, x) exists and the marginal 
density p(x) := Jf^ f(y, x) dy of the predictor is positive, we are estimating 
the conditional density 

(1-1) f(y\x):=l^. 

p{x) 

It is traditional to refer to this design as random and to the marginal density 
p as the design density, regardless of the fact that it may be known or un- 
known to the statistician. The second model is where predictors are created 
by a deterministic procedure and then responses are generated according 
to a conditional density f(y\x). This is the case of a so-called fixed design. 
A discussion of these two designs can be found in [5, 31]; an interesting 
probabilistic point of view is presented in [1]. 

We also need to take into account that (i) the response can be either 
bounded or unbounded (the former case is typical in practical applications 
and the latter is of theoretical interest); (ii) the smoothness of the c.d. f{y\x) 
may depend on the direction (it can be anisotropic), and moreover, if the 
response and predictor are independent, then f(y\x) = f(y); (hi) different 
losses can be used to evaluate the quality of estimation of the c.d. All these 
issues will be explored in this paper. 

The level of known results on c.d. estimation is not on a par with the 
theory of multivariate density estimation. The latter is the reason why using 
(1.1) has been the main approach to assess the optimality of a c.d. estimator. 
To give an example of how this formula is used in the literature, let us note 
that an isotropic bivariate density with two derivatives for each component 
can be estimated with Mean Integrated Squared Error (MISE) of order 
re~ 2 / 3 and then if the design density is sufficiently smooth (say it is twice 
differentiable) , this implies that the conditional density can also be evaluated 
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with MISE of order n~ 2 / 3 . While such an approach is legitimate, it has 
obvious limitations. In particular, it cannot resolve many basic issues like 
how smoothness of the design density affects estimation of the conditional 
density or how to consider a classical fixed design regression. 

Formula (1.1) has also been an inspiration for creating ad hoc estima- 
tors of the c.d. with the main theoretical emphasis on the bias- variance 
analysis. The interested reader can find a historical overview of this and 
related approaches in the books [5, 15, 34]; other relevant references are 
[2, 16, 17, 21, 22, 23, 26, 27, 28, 35]. 

The content of the article is as follows. Section 2 presents the setting. 
Sections 3 and 4 describe new sharp minimax lower bounds under ^([0, l] 2 ) 
and L,2((— oo, oo) x [0,1]) losses, respectively. A c.d. estimator is defined in 
Section 5. An oracle inequality which shows how well the estimator matches 
an oracle that knows an estimated c.d. is presented in Section 6. Minimax 
properties of the estimator are established in Section 7. Optimal design of 
predictors for controlled experiments is explored in Section 8. Discussion 
of the results obtained, including analysis of real datasets, can be found in 
Section 9. Proofs are deferred to Section 10. 

The following notation will be used throughout the article: i always de- 
notes the complex unit, that is, i 2 = —1; Re{-} is the real part; o(l)'s are 
generic sequences in n such that o(l) — > as n — > oo; Q is a positive con- 
stant; C's are generic positive constants; (cc)+ :=max(0,x); \x\ is the inte- 
ger part of x; /(•) is the indicator; the cosine basis on [0, 1] is denoted by 
fo(x) := l,<pj := 2 1 / 2 cos(7rjx), j = 1,2.... We shall use two different loss 
functions to study the performance of an estimator f(y\x): L2([0,l] 2 ) loss, 
which is Jj 1 i2(/(y|a;) — f(y\x)) 2 dydx, and L2((— oo, oo) x [0, 1]) loss, which 

is Io[I^°oo(f(y\ x ) ~ f {u\ x )) 2 dy\ dx . If these two loss functions are consid- 
ered simultaneously, then the area of integration is not written with the 
understanding that it corresponds to an underlying loss. 

2. Considered model. Observations are n pairs {(Yi,Xi),l = 1, . . . , n} 
which are generated according to one of the following two designs, (i) Ran- 
dom design. The pairs are independent samples from a pair (Y, X) of two 
random variables (the response and the predictor) with the joint density 
f(y,x). Set p(x) := f(y,x) dy for the marginal (design) density of the 
predictor X. Assume that p{x) is positive over its support. Then the problem 
is to find a corresponding conditional density (c.d.) f(y\x) := f(y,x)/p(x). 
(ii) Fixed design. Let X\,... ,X n be a deterministic sequence. Then a corre- 
sponding sequence of independent random variables Yi, . . . ,Y n is generated 
according to a c.d. f(y\x), that is, given Xi = x, the response Y[ is distributed 
according to the density f(y\x) which should be estimated. 
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In what follows, it is always assumed that predictors take values from 
the unit interval [0,1], which is also the support of the random predic- 
tor X. Further, in a fixed design case, it is assumed that (X\ , . . . , X n ) 
is a permutation of (Xt^, . . . ,Xr n \) generated by the algorithm X^ = 0, 

J x p(x) dx = (re + I = 0, 1, . . . , re, JT( n+1 ) = 1, where p(x) is a posi- 
tive probability density supported on [0,1]. Hereafter, p(x) will be referred 
to as the design density, regardless of an underlying design. 

The considered statistical problem is to estimate the c.d. f(y\x) as a bi- 
variate function under the Mean Integrated Squared Error (MISE) criterion 
with the two types of loss functions defined in the last paragraph of the 
Introduction. Because an underlying design (fixed or random) is unknown 
to the statistician, a suggested estimator should be universal (not dependent 
on an underlying design). 

We are now in position to discuss possible assumptions about the c.d. 
and the design density. It is traditional in the c.d. estimation literature 
to consider the problem as a particular example of estimation of a bivariate 
density, and this explains a typical assumption that an estimated c.d. f(y\x) 
is isotropic, meaning that it is as smooth in y as in x. Let us recall that the 
most popular assumption is the twofold partial differentiability of f(y\x) 
with respect to y and x; see the literature mentioned in the Introduction. 
In general, such an assumption may be reasonable for the joint density of 
two abstract random variables, but in a regression setting, there are obvious 
differences between the predictor and the response. Only as an example, 
which makes this point crystal clear, let us consider an additive regression 
model Y = m(X) +e, where e is an independent error with density q(z). 
Then f(y\x) = q(y — m(x)) and it is easy to realize that the smoothness of the 
c.d. in y is dependent solely on the smoothness of q(z), while the smoothness 
of the c.d. in x depends on the smoothness of q(z) and the smoothness of 
the underlying regression function m(x). Thus, it is prudent to assume that 
the c.d. may be an anisotropic bivariate function whose smoothness depends 
on the direction; corresponding classes of such functions will be introduced 
in Sections 3 and 4. 

3. Sharp local-minimax lower bound for L2([0, l] 2 ) loss. The main aim 
of this section is to understand how an estimated c.d. together with an 
underlying design density affect the MISE. To explain the employed local- 
minimax approach (which originated in [18]), let us recall, following that 
article, a classical lower local-minimax bound for estimation of a univariate 
density f(y) over the unit interval [0,1]. It is assumed that the density 
is close to a given pivotal density fo(y). Suppose that the pivotal density 
is continuous and bounded below from zero on the interval [0, 1] and no 
assumption about fo(y) for y beyond the unit interval is made. Introduce 
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a class of densities S(m,Q,f ,p) := {f(y)-I- oc f(y)dy = l,(y) > 0,/(y) = 
fo(y)+g{y),y£ [0,1], y eS(m,Q), sup yg[0jl] |y(y)| < p}, where S(m,Q) is a 
Sobolev class of functions y(y) that are m-times differentiable on [0, 1] and 
S(m, Q) := {g(y):g(y) = E^i m {y),y G [0, 1], E^i(^) 2m ^ 2 < Q}- 

Following [5, 7, 12], consider estimation of a univariate density /(y) based 
on a sample Yi, . . . , Y n generated according to this density, set a,j := (vrj) 2m , 
define dy :=dy(/) := J$ f(y)dy as the coefficient of difficulty, introduce a 
positive sequence pYn such that dyn -1 J2^=o([ a j / ^Yn] 1 ^ 2 — a j)+ '■= Q-, and 
then set 

oo 

M::=d y n- 1 E(l-[a^ yn ] 1 / 2 ) + . 
Pinsker [33] evaluated M* and showed that M* = M„(/)(l + o(l)), where 

(3.1) M n (f) = [P(m)Q 1 /(2m+l) ][(iy(/)/n] 2 m /(2m+l) 

and 

(3.2) P(m) = (2m + l)V(2m+l) [ m /(vr(m + i))] 2 " 1 /^ 1 ). 

Note that P(m) and/or P(m)Q 1 /( 2m+1 ) may be referred to as the Pinsker 
constant. Then, it is established in [18] that for a slowly vanishing positive 
sequence p n , the following local-minimax lower bound holds: 

(3.3) inf sup M~ l (f)E f \ f\f(x)-f(x)) 2 dx}> l + o(l), 

/ feS(m,Q,f ,p n ) U0 J 

where the infimum is taken over all possible estimators / based on n real- 
izations Yi, . . . ,Y n , the pivotal density /o(y) and parameters m, Q and p n . 
Moreover, this lower bound is sharp because it is attained by data-driven 
estimators; see [3, 5, 9]. 

This is the approach that we would like to take for the c.d. problem, and 
this is the result to match. To this end, we introduce a similar setting for con- 
ditional density estimation. Let mx and my be positive integers. Consider 
a bivariate function g(y,x), (y,x) £ [0, l] 2 which is my-times differentiable 
with respect to y and mx-times differentiable with respect to x (here and 
in what follows, partial differentiation is meant) and which belongs to a 
corresponding anisotropic Sobolev class, 

{oo 
g(y,x):g(y,x)= ^ 0j r (fj(y)(p r (x), (y,x) G [0, l] 2 , 
j,r=0 

(3-4) 

OO 1 

j,r=0 ) 



G 



S. EFROMOVICH 



This Sobolev class is well known in the statistical literature; see the discus- 
sion in [25]. Let fo(y\x), (y,x) £ (—00,00) x [0,1], be a pivotal conditional 
density which is continuous and bounded below from zero on [0,1] 2 , no 
assumption about fo(y\x) for (y,x) beyond the unit square being made. In- 
troduce a class of conditional densities <S(my, mx, Q, fo(y\x), p) ■= {f(y\x) : 
!- oo f{v\x)dy = lJ{y\x)>0,(y,x) € (-00,00) x [0, 1]; f(y\u) = f (y\x) + 
g(y,x),(y,x) e [0,1] 2 ; g(y,x) G S(m Y ,m x ,Q); swp^ e[0jl] 2 \g(y, x)\ < p}. 
Also, let p(x), Jq p(x) dx = 1, be the design density which is continuous and 
bounded below from zero on [0, 1] . Then, similarly to the univariate density 
setting, the problem is to explore a local-minimax estimation over this c.d. 
class. Set aj r := (irj) 2mY + (irr) 2mx , introduce the coefficient of difficulty of 
estimation of the c.d. over the unit square, 

(3.5) d:=d(f,p):= [ f (y^p' 1 (x) dy dx , 

J [0,1] 2 

define a positive rj n such that 

00 

(3.6) dll~ l ^ {[ a jrhn] l/2 - djr)+ ■= Q 

j,r=0 

and then set 

(3.7) R*(S) := dn- 1 £ (1 - [a jr r ?n ] 1 / 2 ) + . 

j,r=0 

It will be shown in Section 10 that Rn(<S) = Rn(f,P,S)0- + °(1))> where 

(3.8) Rn(f,P,S) = [P(a,P)Q 1 /^][d(f,p)n- 1 ] 2T/{2T+1 \ 

a = my, (3 = mx, 1/(2t) := l/(2a) + 1/(2/3), the new Pinsker constant for 
estimation of the c.d. on [0, l] 2 is 

(3.9) P(a, P) := Tr" 4 ^ 2 ^ 1 ) [J, (a, /J)]" 1 /^) J 2 (a, (5) 
and 

(3.10) Ji(a,/3):= / ([u 2a +v 2l3 ] 1/2 -[u 2a +v 2 P])dvdu, 

J{(n,?)):n 2a +-u 2 ' 3 <l;u,i)>0} 

(3.11) J 2 (a,/3):= / (1- [u 2a +v 2f3 ] 1/2 )dvdu. 

J{(u,v) : u 2a +v 2 P <l;u,v>0} 

We can now present a local-minimax lower bound for c.d. estimation. In 
what follows, -^(/(j/l^pO))!'} denotes the expectation given the c.d. f(y\x) 
and the design density p(x); note that this expectation is well defined for 
both random and fixed design settings and we may omit the subscript when- 
ever no confusion arises. 
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Theorem 3.1. Random and fixed designs are considered simultaneously. 
Consider L2([0,l] 2 ) loss. Suppose that a known design density p(x) is con- 
tinuous and bounded below from zero on its support [0, 1] . Then, for a slowly 
vanishing positive sequence p n , the local-minimax MISE of estimation of a 
conditional density f(y\x) satisfies the lower bound 



inf sup E^ y \ x ) jP ^iR n 1 (f,p,S) 

f f(v\x)PS(rnv,rnx.Q,fn,p n ) \ 



f f(y\x)&S{m Y ,m x ,Q,fo,pn) 

(3.12) x / Cf{y\x)- f{y\x)fdydx 



[0,1 



> l + o(l) 



where R n (f,p,S) is defined in (3.8) and the infimum is taken over all possi- 
ble c.d. estimators f(y\x) based on n independent pairs (Y\, X\), . . . , (Y n ,X n ) 
of observations, generated according to (f(y\x),p(x)), as well as on the piv- 
otal conditional density fo(y\x), the design density p(x) and the parameters 
my, mx, Q and p n . 

Sobolev function classes are classical in the regression literature. The den- 
sity estimation literature also considers smoother function classes such as 
analytic ones; see discussion in [5, 20, 24, 28, 36]. Thus we shall complement 
the class of differentiable c.d.s considered above by two classical classes of 
smoother functions. Let 7, 71 and 72 be positive real numbers and recall that 
Q is a positive real number. We begin by introducing an analytic-Sobolev 
class of bivariate functions, 

{00 
9{y,x):g(y,x)= ^jr l Pj(y) i Pr(x),(y,x) £ [o,i] 2 , 
j,r=0 



(3.13) 6 jr = / g(y,x)<f j (y)ip r (x)dydx, 

J [0,1] 2 

00 ^ 

[(e^ + (nr) 2m x)I(j + r > 0)}9% < Q . 

j,r=0 J 

This class includes bivariate functions g(y,x) which are analytic in y and 
mx-fold differentiable in x. It is also possible that the conditional density is 
analytic in both y and x. Let us then define an (anisotropic) analytic class 

{00 
g{y,x):g(y,x) = ^ 9 jr (p j (y)ip r (x),(y,x) e[0,l] 2 , 
j,r=0 



(3.14) 9j r = I g(y,x)<pj(y)ip r (x)dydx, 

J [0,1] 2 
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J2 [(e™ + e~ni{j + r > 0)}9] r < Q . 

j,r=0 ) 

Then, analogously to the local class 5(my , mj, Q, fo(y\x), p n ) defined 
above, we can introduce local classes AS( 7 , mj, Q, fo{y\x), p n ) and 
"4(7i> 72, Q, fo(y\x), p n )- For these two classes relations (3.6)-(3.7), with cor- 
responding a jr = (e^i + (■Kr) 2mx )I(j + r > 0) and a jr = (e 7 ^' + e< 2r )I{j + 
r > 0), imply 

R n {f,p,AS) 

(3.15) =P(m X )Q 1 /(2m x +l) (d(/)p)/n) 2 m x/(2m x+ l) 

x [2m x Hn)/{{2m x + i) 7r7 )] 2 ^/( 2 ^+ 1 ) ) 
where P(mx) is defined in (3.2), and 

(3.16) Rn(f,P,A) = (vr 7l72 )- 1 ( i(/,p)n- 1 ln 2 (n). 

To shorten the presentation of lower bounds, in the following proposition, 
we will consider these two local classes of conditional densities together. 

Theorem 3.2. Random and fixed designs are considered simultaneously. 
Consider L2([0,l] 2 ) loss. Suppose that a known design density p{x) is con- 
tinuous and bounded below from zero on its support [0, 1] . Then, for a slowly 
vanishing positive sequence p n and T being either AS{^,mx ,Q, fo, Pn) or 
^l5(7i,72,Q,/o,Pn); the local-minimax MISE of estimation of a conditional 
density f(y\x) satisfies the lower bound 

inf sup E, f , y \ x)p , x)) \R- l {f,p,F) I (/ \y\x) - f \y\x)f dy dx) 

(3.17) 

> l + o(l), 

where R n (f,p,J-) is defined in (3.15) or (3.16) depending on the considered 
class J- and the infimum is taken over all possible c.d. estimators f(y\x) 
based on n independent pairs (Y]_,Xi),. . . , (Y n ,X n ) of observations gener- 
ated according to (f(y\x),p(x)), as well as on the pivotal conditional density 
fo(y\x), the design density p{x) and all parameters defining the class T. 

A plain analysis of the coefficient of difficulty d(f,p) defined in (3.5) 
indicates that if f f{y\x)dy = 1, then the coefficient of difficulty does not 
depend on the underlying c.d. f{y\x). The latter is the case if [0, l] 2 is the 
support of the joint density f{y\x)p(x). Let us stress that the main reason 
why we are considering a local-minimax is to explore how an underlying c.d. 
affects the coefficient of difficulty. Using a similar proof, it is straightforward 
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to establish that if L2((— oo, oo) x [0, 1]) loss is considered, then the coefficient 
of difficulty does not depend on an underlying c.d. since, in this case, we 
always have f(y\x)dy = 1. Due to this remark, we can omit the analysis 
of local-minimax MISEs for L2((— oo, oo) x [0, 1]) loss and consider a classical 
minimax approach in the next section. 

4. Sharp minimax lower bounds for L2(( — oo, oo) X [0,1]) loss. The 

aim of this section is to find minimax lower bounds for several anisotropic 
classes of conditional densities and L2((— oo, oo) x [0, 1]) loss. The latter is of 
special interest in the case of unbounded responses. Because a loss function 
is given, no ambiguity occurs if we use identical notation for function classes 
defined on [0, l] 2 (as in the previous section) and those defined on (— oo, oo) x 
[0, 1] (considered in this section). The motivation for this abuse of notation 
is that corresponding spaces are defined in such a manner that they imply 
the same MISE convergence for both of the considered losses and this will 
allow us to shorten the presentation of results. 

We begin with a Sobolev anisotropic class of conditional densities, 



S(m Y ,m x ,Q):=\f(y\x):f(y\x) = jr(2Try 1 f 



h r (u)e luy duip r (x) 
f(y\x)>0,l f(y\x)dy=l, 



(4.1) 

(y,x) G (-00,00) x [0,1] 



V^ 1 / [u 2mY + (TTr) 2mx ]\h r (u)\ 2 du<Q\. 

r= ^ J 



To shed light on the functions h r (u), it may be helpful to note that if 
h(u\x) := /f^ f{y\x)e tyu dy denotes the conditional characteristic function, 
then 

(4.2) h r {u) := / h(u\x)ip r (x) dx 

Jo 

is its rth Fourier coefficient and one can write h(u\x) = Y^=oh r {u)(p r {x). 
The Sobolev class (4.1) contains bivariate functions g(y, x), (y, x) G (—00, 00) x 
[0,1], having the square integrable generalized my-fold partial derivative 
with respect to y and the square integrable generalized mx-fold partial 
derivative with respect to x; see the discussion in [32, 36]. 

Another anisotropic class to consider is an analytic-Sobolev one, 

{°° roo 
f(y\x):f(y\x) = Y,(^r 1 / h r {u)e'^ duip r {x), 
r=0 J ~°° 
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/oo 
f(y\x)dy=l, 
-oo 

(4.3) 

(y,x) € (-00,00) x [0,1], 



00 roc "I 

Vtt- 1 / [e^ + (7rr) 2m ^]|/ lr (n)| 2 d U <Q , 



r=0 

which is an analog of (3.13). Note that this class includes, among others, 
classical normal, Student and Cauchy conditional densities, as well as their 
mixtures and one-to-one transformations which are typical in additive re- 
gression; see the discussion in [20, 36]. 

Finally, similarly to (3.14), we define an anisotropic analytic class 

f 00 rOO 

,4(71, 72, Q) ■= f(y\x) : f{y\x) = Y J {^r l J^h r (u)e- iu yduvr(x), 



r=0 

00 



(4.4) f(y\x)>0, f(y\x)dy = l, 

J —00 

(y,x) € (-00,00) x [0,1], 



00 poo "I 

V^ 1 / [e 7lU + e ,2r ]\h r (u)\ 2 du<Q}. 

r=0 J ° J 



dx 



We can now present lower minimax bounds. 

Theorem 4.1. Random and fixed designs are considered simultaneously. 
Consider the case of Li{{— 00, 00) x [0,1]) loss. Suppose that a known design 
density p(x) is continuous and bounded below from zero on its support [0, 1] . 
Then 

f rl r roo v 

inf sup -^/(yi^p^i / / (/(j/ \x) - f{y \x)) dy 
(4.5) 

>i? n (p,^)(l+o(l)), 

where the infimum is taken over all possible estimators f based on the design 
density p(x), the class T and n independent pairs of observations (Y[,Xi), 
I = 1, . . . ,n, generated according to (f(y\x),p(x)). The asymptotic minimax 
risk _R n (p, .F) is defined as follows. For the considered loss, the coefficient of 
difficulty is simplified to d := d{p) = Jq p _1 {x) dx, and then, 

(a) for an anisotropic Sobolev class T = S(mY,mx,Q), the risk R n (p, J-) 
is equal to the right-hand side of (3.8), with a = my, (3 = mx and d(f,p) 
replaced by d(p); 

(b) for an analytic- Sobolev class T = AS{^,mx,Q), the risk R n (p, F) 
is equal to the right-hand side of (3.15), with d(f,p) replaced by d(p); 



CONDITIONAL DENSITY ESTIMATION 



11 



(c) for an anisotropic analytic class J- = ^.(71,72, Q), the risk R n (p,J-) 
is equal to the right-hand side of (3.16), with d(f,p) replaced by d(p). 

We have obtained lower bounds for the MISE which allow us to introduce 
the notion of sharp minimax estimation of the c.d. over a class J- of c.d.s 
whenever the MISE of an estimator attains a corresponding lower bound. 

5. EP conditional density estimator. The objective of this section is to 
suggest a conditional density estimator which is (i) adaptive to (in gen- 
eral unknown) design of predictors; (ii) simultaneously sharp minimax over 
the aforementioned anisotropic classes of conditional densities; (hi) at least 
univariate-rate minimax when the c.d. is a univariate density, that is, under 
the classical null hypothesis "the response is independent of the predictor." 

The last aim makes it reasonable to rewrite a c.d. as a sum of a univariate 
component and a bivariate component, 

(5-1) f(y\x) = f(y) + ^(y,x), 

where ip{y, x) vanishes if the response is independent of the predictor. A pair 
of these components is defined differently for the two studied loss functions 
and definitions will be presented shortly. (Let us stress that a loss function is 
known to the statistician, so an estimator may be chosen accordingly.) Then 
a blockwise-shrinkage Efromovich-Pinsker (EP) estimator will be developed 
for the estimation of f{y) and ip(y,x). 

Remark 5.1. The interested reader can find a comprehensive discus- 
sion of the EP estimation procedure in [5]. Here we briefly recall its main 
idea. Suppose that a bivariate function g(u,v) is estimated on a set A 
and suppose that there exists an orthonormal basis {ipj s (u,v); j, s > 0} on 
A such that g(u,v) = Y^s=o K js l Pjs{'u,v), Kj s = f A g(u,v)ipj s (u,v)dudv for 
(u, v) £ A. Then a blockwise-shrinkage EP estimator is defined as follows. 

All indices (j,s) are divided into nonoverlapping blocks B^, k = 1,2, 

Also, a sequence of positive thresholds tk and a cutoff K are chosen. Blocks, 
thresholds and the cutoff may depend on the sample size n. An EP estimator 
can then be written as 

K 

g(u,v) :=^2fik K js ip js (u,v), (u,v)eA, 

k=l (j,s)€B fc 

where 

fcjs is tin estimator of Kj s , with a method of moments estimator be- 
ing a typical choice, and fi^ = fi(Bk,tk,n,{Rj s , (j, s) € B^}) is a shrinkage 
coefficient. Let us stress two facts about this estimator. First, neither blocks 
Bk, nor thresholds t^, nor the cutoff K, nor the shrinkage-coefficient func- 
tion j! depends on observations; instead, they are chosen a priori. Second, 
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adaptation to the smoothness of an underlying function g is achieved solely 
by shrinkage coefficients via their dependence on observations. The un- 
derlying idea of using a shrinkage procedure is to mimic Wiener's optimal 
shrinkage coefficient (oracle) n% := E(,»eB fe K is/E(j, s )eB fc ( K fs + Var (^))]- 

Let us present assumptions and notation. 

Assumption 1. An estimated conditional density f(y\x) belongs to a 
Sobolev class 5(1, 1, C), C < oo, defined either in (3.4) or (4.1) for L 2 ([0, l] 2 ) 
loss or L 2 ((— oo,oo) x [0,1]) loss, respectively. 

Remark 5.2. It is convenient to define the Sobolev class (3.4) as a class 
of bivariate functions and the Sobolev class (4.1) as a class of conditional 
densities. Nonetheless, because f(y\x) is the c.d., this difference plays no 
role in Assumption 1. 

Assumption 2. The design density p(x), x € [0,1], is bounded below 
from zero on its support [0, 1] and its first derivative p^ (x) exists and is 
bounded on [0, 1] . 

Notation. Whenever no ambiguity may arise, sets of integration are 
omitted and, for instance, double integrals are taken over [0, l] 2 or (— oo, oo) x 
[0, 1], depending on the loss under consideration. M denotes the set of non- 
negative integers. Two given arrays of nonnegative numbers {0 = b\ < 6 2 < 
• • •} and {b[ = 0, 6 2 = 1 + Lln 3/4 (n)J , b' s+1 = b' s + [b' 2 (l + 1/ lnln(n)) s - 2 J , s = 
2,3,...}, will be used to define blocks and two given arrays of positive num- 
bers, {ti, • • •} and {tk T '■= 1/ lnln((A; + 3)(t + 3)), A;, r = 1, 2, . . .}, will de- 
note thresholds. Two different arrays of blocks are used for estimation of 
univariate f(y) and bivariate ip(y,x) components of f(y\x); recall (5.1). 
The former is {-£>&,£; = 1,2,...}, where the blocks are either consecutive 
sets of nonnegative integers B^ := {j:bk < j < bk+i,j € J\f} or intervals 
Bk '■= [bk, frfc+i) f° r -^([0, l] 2 ) or L 2 ((— oo, oo) x [0, 1]) loss, respectively. The 
latter blocks are either sets of pairs of integers B^ T := {(j,r):b' k < j < 
b' k+l ,b' T < r < b' T+1 ,j, r € A^} for the £ 2 ([0, l] 2 ) loss or sets of mixed pairs of 
real and integer numbers B^ T := {(u, r):u G [b' k , b' k+1 ),b' T < r < b' T+1 ,r € ft/} 
for the other loss. The corresponding lengths/cardinalities of these blocks 
are := b^ + i — bj. and L^ T := {b' k+l — b' k )(b' T+1 — b' T ). In oracle inequalities, 
we shall also use so-called adjusted lengths 

(5.2) L kr i— j ^ — j 

T,(j, r )eB kT [\ Jo P~ 1 (x)(p 2r (x)dx\ + J Q \ J Q f '(y\x)(pj(y) dy\ 2 dx] 
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or 



kr 



(5.3) 



u,r) € B kT ) 



+ 



P 1 (x)ip2r(x) dx 



e^f{y\x)dy 



dx 



du 



for -^2([0, l] 2 ) loss or the other loss, respectively, and let L* k := L k . 

We can now define EP estimators for the two losses in turn. 

EP estimator for L2QO, l] 2 ) loss. Here, the c.d. f(y\x) is estimated over 
the unit square [0, l] 2 and then [recalling the decomposition (5.1)] 



dx. 



00 . 
(5.4) f(y) = Y,0 m {y), %:=/ f(y\x)^(y)dy 

The Fourier series (5.4) implies a familiar univariate EP estimator, 



K 



f(y) : =X]£fc J2 e m(v)i 

k=l j£B k 



(5.5) 

where the 9j are empirical Fourier coefficients, 

n 

(5.6) 9 j :=n- l J2l(Yl G [0, l])^)? -1 ^) 

and the /ifc are plugged-in Wiener shrinkage coefficients, 

e fc 



(5.7) 

where 
(5.8) 

and 
(5.9) 



—I(Q k >t k dn 1 ), 



d: =n -i^/(y ze [0,l]) ] 5- 2 (X / 



estimates the coefficient of difficulty d = Jj jp f(y\x)p 1 (x) dy dx. The truncat- 
ed-from-zero design density estimator is 



(5.10) 



p(x) := max(l/ lnln(n),p(x)), 
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the pivotal design density estimator p(x) is an orthogonal series estimator 

n 1 / 3 n 

(5.11) p(x):=l + n- l Y / T, l Pr(Xi), 

r=l Z=l 

and the cutoff K used in (5.5) is the minimal integer such that bx+i > 
re 1 / 3 lnln(n). 

The underlying idea of EP estimation of the bivariate function ip(y,x) is 
based on the expansion 

oo oo 
j=0r=l 

(5.12) 



Ojr'-= / f(y\x)(fj(y)<f r (x)dydx. 

J[0,1] 2 

Note that the bivariate function (5.12) vanishes (as it should) if f(y\x) does 
not depend on x. The corresponding bivariate blockwise-shrinkage EP esti- 
mator is then 

T 

(5.13) i>(y,x) := ^ flkr ^jr i Pj(y) l Pr(x), 

k,r=l (j,r)GB kT 
n 

(5.14) § jr -nr^KXi e foiDwQQtprWprHXi), 

1=1 

(5.15) jl kT := ® kT — -I(@kr > t kT dn~ l ), 

fc)fc T + an 1 

(5.16) &kr-=L^ Yl 0jr-dn-\ 

U,r)£B kT 

where d is defined in (5.9), p(x) in (5.10) and T is the minimal integer such 
that b' T , 1 > n 1 / 4 lnln(n). The EP estimator is then defined as f(y\x) := 

f(y)+4>(y,x). 

EP estimator for L2(( — oo, oo) x [0, 1]) loss. Here, in representation (5.1), 
we have 



(5.17) f(y) = (27T)" 1 / h ( U )e- iu Vdy 

J — oo 

and 

00 POO 

(5.18) tp(y,x) = Y(^)~ 1 h r (u)e- iuv duip r (x), 



r=l 
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where 

(5.19) h r (u) := / h(u\x)ip r (x) dx, r = 0, 1, 

Jo 

and 

/oo 
f{y\x)e iv »dy 
-oo 

is the conditional characteristic function of Y given X = x. This approach 
implies the following EP estimation of f{y): 



(5.21) f{v)-=*- L / Re{/i (n)e- m n^ 
where 

(5.22) Mu):=5^/ifc^o(«)J(«GS fc ), « > 0, 



K 



k=l 

n 



(5.23) h r (u):=n- 1 Y,e iuYl <Pr{Xl)P' 1 ( x i)> r = 0,l,..., 



=i 



(5.24) fi k := @k - J(8 fc > tfcdn" 1 ), 

(5.25) Ofc^Lr 1 / ^(ujpdu-dn -1 , 



and because for the considered loss the coefficient of difficulty simplifies to 
fo 1 



d = Jq p 1 (x) dx, we use the estimate p(x) defined in (5.10) and then set 



(5.26) d:= f p~ 1 (x)dx. 

Jo 

Further, the EP estimator of ip(y,x) is defined as 

i>(y,x) 

(5.27) 

:=n- 1 VkrYl / H(u,r) e B kT )Re{h r {u)e- iuy }duip r (x), 

k,T=l r=l J ° 

where 

(5.28) fa := 9fcT - J(8 fcT > ifrrdn- 1 ) 

Bfc r + tin 1 

and 

°° POO 

(5.29) &kr:=L^J2 I((u,r)£B kT )\h r (u)\ 2 du-dn-\ 
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The EP c.d. estimator is then defined as 

(5.30) f{y\x):=f(y)+^(y,x). 

Note that definition (5.30) of the EP c.d. estimator is the same for both 
losses, but the two additive components are different. This abuse of notation 
will allow us to consider the two losses simultaneously. 

6. Oracle inequality. The aim of this section is to show that the MISE 
of the EP estimator matches the MISE of an oracle that knows an estimated 
c.d. and has excellent statistical properties. As in the previous section, we 
are simultaneously considering two possible designs of predictors and two 
losses, L 2 ([0,1] 2 ) and L 2 ((-oo,oo) x [0,1]). 

Let us introduce a blockwise-shrinkage oracle f*(y\x), motivated by Wiener's 
filter, which serves as a benchmark for the EP c.d. estimator f{y\x). It is de- 
fined as the estimator (5.30) with estimated shrinkage coefficients replaced 
by coefficients depending on (f,p), 

(6.1) ^ k :=Q k /[®k + d(f,p)n- 1 ] and ^ := Q kT /[Q kr + d{f,p)rT\ 

in place of the corresponding statistics fi k defined in (5.7) or (5.24) (for 
the two losses) and fi kr defined in (5.15) or (5.28) (for the two losses), 
respectively. Here and in what follows, d(f,p) is defined in (3.5) for L 2 ([0, l] 2 ) 
loss, d(f,p) = d(p) = j^p~ l {x)dx for L 2 ((— 00,00) x [0,1]) loss, and and 
0fc T are Sobolev functional defined as 

(6.2) Q k :=L~ k l Y. e l ®kr:=L^ J2 °l 

for L 2 ([0, l] 2 ) loss and 

Q k :=Ll l ( \h (u)\ 2 du, 



OO 



(6.3) 

Ofcr^^E / I((u,r)eB kT )\h r (u)\ 2 du 

r=l J ° 

for the other loss. 

Theorem 6.1. The cases of bounded and unbounded responses, as well 
as the cases of the two studied losses, are considered simultaneously. Suppose 
that Assumptions 1 and 2 hold. Then the following oracle inequality holds 
for the EP estimator f(y\x): 

(6.4) E J(f(y\x)-f(y\x)) 2 dydx<E J \f*(y\x) - f{y\x)) 2 dydx + S n , 
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where 



8r, < Cn 



(6.5) 



E[^(4 /2 +^V /2 )+^V] 

k=\ 



+ E [W fc r(^V(LL)- 1/2 t fc - 3/2 ) + (^)" 2 i" 5 



A:t - 



k,r=l 

and the omelets MISE satisfies 



E J (f*(y\x) - f(y\x)) dydx 
(6.6) = n~ 1 c*d(f, P ) 



K T 
E Lk^k + E Lkr^ki 

Lfc = l k,T=l 



+ c* 



Y,L k Qk+ E /((^r)^[l,T] 2 )L fcr e fcT 

k>K k,r=l 



where c* = 1 for L2GO, l] 2 ) loss and c* = it 1 /or L2((— 00, 00) x [0, 1]) loss 
and where, for any two arrays {u k G (0, 1), k = 1, 2, . . .} and {i/fc r G (0, 1), fc, r = 
1,2,...}, 



K 



\5* n \ < c-dif^n' 1 E £*A*fcfafe + CV fc 7V fe (L^ + n" 1 / 4 ) 



(6.7) 



fc=i 



r 



+ c*d(f,p)n 1 E L kT^krWkr + Cu k ^ii kT (L* kT ) 1 \. 



k,r=l 



The oracle inequality shows how well the EP estimator matches the oracle's 
risk. Note that it is a pointwise inequality (it is valid for a particular un- 
derlying c.d.) and it is exact (not asymptotic). The oracle inequality also 
allows us to establish minimax properties of the EP estimator, and this will 
be done in the next section. 



7. Minimax properties of the EP c.d. estimator. The oracle inequality 
of Theorem 6.1 allows us to establish a number of useful minimax results. 
We need an extra assumption on blocks and thresholds which is common in 
the literature; see the discussion in [3, 5, 6]. 

Assumption 3. Assume that t k — > and L k +i/L k — > 1 as k — > 00 and 
thatYX^Vo - 
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Remark 7.1. To simultaneously consider the minimax approaches of 
Sections 3 and 4, it is assumed that for the local-minimax approach of Sec- 
tion 3, only an unknown additive component g(y,x) of the c.d. f(y\x) = 
fo(y\ x ) + g{y^ x ) is estimated by the bivariate EP estimator based on empir- 
ical Fourier coefficients (5.6) and (5.14) minus corresponding Fourier coeffi- 
cients of the known pivotal c.d. fo{y\x). Note that a pivotal c.d. traditionally 
studied in upper bounds is fo(y\x) = c < 1, (y, x) € [0, l] 2 , for which there is 
no difference between the local-minimax and minimax EP estimators, apart 
from estimation of the single Fourier coefficient 0q. 

Theorem 7.1. The cases of fixed and random designs are considered 
simultaneously. Let Assumptions 1-3 hold. Then for each loss, a correspond- 
ing EP c. d. estimator, defined in Section 5, is simultaneously sharp minimax 
over Sobolev, analytic- Sobolev and analytic classes of conditional densities 
considered in Sections 3 and 4, that is, the MISE of the EP c.d. estimator 
attains the lower bounds of Sections 3 and 4- 

We are now in position to show how well the bivariate EP estimator will 
perform in the case of the classical hypothesis, "the response is indepen- 
dent of the predictor." Under this hypothesis, suppose that an oracle f*(y) 
knows that the response is independent and then estimates the univariate 
density f(y) = f{y\x) based on n i.i.d. responses Y±, Y%, . . . , Y n . Obviously, 
this univariate oracle can be considered as a benchmark for any bivariate 
c.d. estimator given that the hypothesis is true. Our aim is to compare the 
bivariate EP c.d. estimator developed above with such an oracle. 

We shall consider three classical classes of univariate densities. In what 
follows, a is a positive integer and 7, Q and q are positive real num- 
bers. Let us begin with a class of differentiable univariate densities. For 
L2((— 00,00)) loss (recall that now a univariate density is estimated), we in- 
troduce a familiar Sobolev class S(a, Q) := {f(y) : Jf^(/ (a) (y)) 2 dy<Q} = 
{h(u) :7T _1 f£° |u| 2q |/i(?x)| 2 du < Q}, where /W is the crth generalized deriva- 
tive and h(u) := f'^ OQ f{y)e %uy dy is the characteristic function; see [19, 32, 
36]. With some obvious abuse of notation, for the case of ^([0, 1]) loss, we 
define a similar Sobolev class S(a, Q) := {f(y) : T,j°=i{nj) 2a 6] <Q;9 >c > 

O,0;:=/ o ~/(y)^(l/)dy}. 

Let us now consider analytic densities. For the case of L2((— 00, 00)) loss, 
a class of such densities was introduced in [20]: -4(7, Q) := {f(y) '■ vr -1 / °° e 7 " x 
\h(u)\ 2 du < Q;h(u) = f^e^ f(y)dy}. An L 2 ([0,1]) counterpart of this 

class is .4.(7, Q) := {/ : E£i e*™'0j < Q,0 > c > O;0,- = J* f(y)<pj(y) dy}; 
see [5]. 

Finally, a bounded spectrum class is defined as B(q) := {f(y) : h{u) = 
0,|u| >q;h(u):=fZ c e iu yf(y)dy} or B{q) := {f(y) : % = 0,j > q,6 > Co > 
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0;6j := Jq 1 f(y)ipj(y) dy} for L2((— 00,00)) or L2QO, 1]) loss, respectively. A 
comprehensive discussion of this class can be found in [30]. Let us recall that 
Ibragimov and Hasminskii [24, 28] have established that the minimax rate 
of convergence for this class is parametric qn -1 . 

Note that in all these definitions it is assumed that f(y) is the density, 
that is, that f(y) > and f(y) dy = 1. 

The following univariate minimax result is well known; see [5, 8]. 

Proposition 7.1. Suppose that the response is independent of the pre- 
dictor. We are simultaneously considering the cases 0/L2QO, 1]) and Li2{{— 00, cx 
losses. There exists an oracle f*{y), based on n i.i.d. observations Y\, Y2, . . . ,Y n 
of the response, which is simultaneously rate minimax for bounded spectrum 
densities and sharp minimax for Sobolev and analytic densities. In partic- 
ular, the EP univariate density estimator of [3], with blocks and thresholds 
{(Lk,tk)} satisfying Assumption 3, may serve as such an oracle and then 

(7.1) sup [dtf))- l E [Cr(y)-f(y)?dy<Cqn-\ 

feB(q) J 

sup [d{f)]- 2a l^ +1 )E [ (f(y) - f{y)fdy 

(7.2) 

= P(a)Q 1 /(2a+l) n -2a/(2 a +l) (1 + Q ^ 



(7.3) sup K/O]^ (f*(y)- f(y)fdy = {n 1 n/\n(n))- 1 {l + {l)), 

feA(~/,Q) J 

where P(a) is defined in (3.2), d(f) is either f^ f{y)dy or 1 and the inte- 
grals in (7.1)-(7.3) are taken over [0,1] or (—00,00) for L2GO, 1]) loss or 
L2((— 00,00)) loss, respectively. 

We can now formulate a minimax assertion for the independent response 
case. 

Theorem 7.2. The cases of fixed and random designs as well as the 
cases of two studied losses are considered simultaneously. Suppose that the 
response is independent of the predictor, that is, f{y\x) = f(y), x € [0,1], 
and Assumptions 1-3 hold. Then the EP c.d. estimator f(y\x) of Section 5 
is simultaneously rate minimax over bounded spectrum, analytic and Sobolev 
classes of univariate densities, namely 

(7.4) sup [d{f,p)\- l E { {f{y\x)- f{y)fdydx<Cqn-\ 
feB(g) J 
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(7.5) 



(7.6) 



sup [d(f,p)r 2a ^ 2a+ ^E {f(y\x)-f{y)) 2 dydx 
feS(a,Q) J 

< P{a)Q l ^ 2a ^ ( n )- 2a /( 2a+1 ) (1 + o{\)), 



sup [d{f,p)]- l E l(f(y\x)-f(y)) 2 dydx 
feA(j,Q) 



< (7r 7 n/ln(n)) \l + o(l)), 



where d(f,p) = f(y)dy Jq p 1 (x)dx for the L2([0,l] 2 ) loss and d(f,p) = 
f p~ 1 (x)dx for the L,2{(— oo, oo) x [0,1]) loss. 

This theorem implies the following sharp-minimax result. 

Corollary 7.1. Let the assumptions of Theorem 7.2 hold. Consider 
the case of the uniform design density p(x) = 1, x 6 [0, 1]. Then the EP c.d. 
estimator f(y\x) is simultaneously sharp minimax over Sobolev and analytic 
univariate density classes. Further, the MISE of this bivariate estimator 
matches the MISE of the univariate oracle f*(y) introduced in Proposition 
7.1, namely, 



(7.7) 



E I (f(y\x)-f(y)) 2 dydx 



(1 + o(l))E J (f*(y) - f(y)) 2 dy + o^n" 1 . 



To avoid any possible confusion, let us explain the integrals in (7.7). Under 
L>2([0, l] 2 ) loss, the left integral is taken over [0, l] 2 , while the right one is 
taken over [0, 1] and the additive components of the EP estimator f(y\x) = 
f(y) + ip(y, x) are defined in (5.5) and (5.13). Under L2((oo, oo) x [0, 1]) loss, 
the left integral in (7.7) is taken over (— oo, oo) x [0, 1] with y £ (— oo, oo) and 
x € [0, 1], while the right integral is taken over (— oo, oo). Also, in this case, 
the additive components of the EP estimator are defined in (5.21) and (5.27). 



8. Optimal design of predictors for c.d. estimation. In a controlled ex- 
periment, the statistician can choose an underlying design density. Obtained 
results allow us to recommend a particular design density which minimizes 
the MISE of c.d. estimation. 

It is worthwhile to begin by recalling a known result for regression func- 
tion estimation. Consider a classical heteroscedastic regression Y = m(X) + 
o(X)e with the predictor X being supported on [0, 1] and the error e being 
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standard normal. It is shown in [5, 13] that the MISE of regression function 
estimation is minimized by a design density 



(8.1) 



p*{x):=a{x)j J a{u)du. 



At the same time, according to previous sections, to minimize the MISE of 
estimation of the c.d., the statistician needs to minimize the coefficient of 
difficulty 



d(f,p) 



f{y\x)dy 



p 1 (x)dx, 



where A is either [0, 1] or (— oo, oo), depending on the loss. A simple calcu- 
lation then shows that the optimal design density for c.d. estimation is 



P*c.d.( x ) :z 



f(y\x)dy 



1/2 



i r 



f(y\u)dy 



lV2 



du. 



In general, optimal designs (8.1) and (8.2) are different, but there is one 
important case where the two coincide. Consider a classical homoscedas- 
tic regression [where the scale function a(x) is constant] and suppose that 
Ja f(y\ x ) dy = 1, x & [0, 1] [note that the latter always holds for A = (— oo, oo)] 
Then the uniform design is simultaneously optimal for the regression and 
c.d. estimation problems. Furthermore, according to Corollary 7.1, if the de- 
sign is uniform, then the suggested bivariate EP estimator is sharp minimax 
under the hypothesis that the response is independent of the predictor. We 
can conclude that the uniform design has a very special place in controlled 
regression experiments. 

Of course, in general, an underlying c.d. is unknown and cannot be used in 
designing an optimal experiment. A sequential design of predictors may then 
be a feasible option; the interested reader can find a discussion of sequential 
designs of predictors in [11]. 



9. Discussion. 



9.1. Effect of the design density. The obtained theoretical results show 
that if the design density satisfies Assumption 2 (which is a mild assumption 
with the main property for the design density being differentiability), then 
the rate of the MISE convergence is determined solely by the smoothness 
of the c.d. The design density may only affect the constant of the MISE 
convergence via the coefficient of difficulty. 
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9.2. Classical example. Consider an additive regression Y = m(x) + e, 
where the regression error e is independent of the predictor and its density 
is analytic (infinitely differentiable with the normal density, Cauchy density 
and mixture of normal densities being the main examples). In this case, 
the conditional density f(y\x) is also analytic in y. Then, if the regression 
function is just a-fold differentiable and Assumption 2 holds, for a corre- 
sponding analytic-Sobolev class the minimax MISE converges with the rate 
[n/ln(n)]~ 2a /( 2a+1 \ This result is both good and bad news. The good news 
is that up to a logarithmic penalty which is a minor factor for the curse 
of dimensionality, the bivariate c.d. can be estimated with the same MISE 
accuracy as the univariate regression function m(x). Furthermore, the de- 
sign density can be very rough (just differentiable), even in the case of an 
analytic error density. The bad news is that even if the error density is an- 
alytic and can then, according to [8], be estimated with the MISE of order 
ln(n)/n (i.e., with almost parametric accuracy), the MISE of c.d. estimation 
is primarily defined by the smoothness of the regression function and thus 
may be dramatically larger than the MISE of error density estimation. 

9.3. Fixed design. Fixed design regression is the classical setting in ap- 
plied regression analysis; see the discussion in [31]. For this setting, definition 
(1.1), which has been the key for the random design case, is not valid. For- 
tunately, this paper shows that a design affects neither lower bounds, nor 
upper bounds, nor the minimax data-driven EP estimation procedure, nor 
oracle inequalities. 

9.4. Dimension reduction. A traditional null hypothesis in regression 
analysis is that a response and a predictor are unrelated, that is, f(y\x) = 
f(y). In this case, the accuracy of estimation under an MISE criterion must 
be dramatically better because the estimated function is univariate and no 
curse of dimensionality occurs. It is established that the EP estimator pro- 
vides an optimal univariate accuracy of estimation when the null hypothesis 
is valid and thus solves the classical dimension reduction problem. 

9.5. C.d. estimation in other settings. The "regression" methodology 
thus far developed can be used in other classical settings, for instance in 
the popular time series one; see the discussion about this setting in [15]. 
The main complication here is that pairs of observations are no longer inde- 
pendent; at the same time, the setting is simpler because covariates cannot 
be deterministic. There are also many interesting expansions in the regres- 
sion setting considered. For instance, the predictor can be a vector and the 
covariates may be qualitative and quantitative; see the discussion of such a 
setting in [21]. Some new results for this setting can be found in [10]. 



CONDITIONAL DENSITY ESTIMATION 23 



SCATTERGRAM SCATTERGRAM 




-3 -2 -1 1 2 3 -2-10 1 



11 - 52 n ■ 183 

Fig. 1. Standard nonparametric regression analysis of two real datasets. The top-left dia- 
gram exhibits a scattergram of 52 observations showing a relationship between the amount 
of a detergent in a sludge and an index of centrifuging. The top-right diagram shows a 
scattergram of 183 observations showing a relationship between speed of rotation and an 
index of centrifuging. Scattergrams are overlaid by nonparametric regression estimates. 
Dotted lines in the bottom diagrams show the standard normal density. 

9.6. Minimax paradigm. This paper uses a classical minimax approach: 
an estimator must be minimax whenever Y and X are dependent (the c.d. 
is a bivariate function) and then, if Y and X are independent (the c.d. is 
a univariate function), it is desirable that the estimator be also minimax 
over univariate estimators/oracles. Note that the priorities are reversed in 
the dimension reduction literature. 

9.7. Small datasets. Let us begin by exploring two datasets collected by 
BIFAR, a company with interests in waste water treatment; the interested 
reader can find a complete account of these experiments in [9]. In what 
follows, freely available software from [5] is used; recall that it is based on 
mimicking EP estimators. Two columns of diagrams in Figure 1 exhibit a 
standard nonparametric regression analysis of two different datasets. The 



24 



S. EFROMOVICH 




Fig. 2. Conditional density estimate, multiplied by a factor of 3, for the speed-index 
dataset exhibited in the top-right diagram in Figure 1; two views are shown. 

top-left diagram shows a scattergram with a pronounced regression func- 
tion. Diagrams below it indicate that the regression is homoscedastic with 
normal regression errors. This dataset is a textbook example, where the re- 
gression function allows one to quantify the impact of the predictor on the 
response. The dataset analyzed in the right column of Figure 1 is more com- 
plicated, due to the stepwise shape of the scale function and the multimodal 
marginal density of regression errors; thus, let us look at the c.d. estimate 
shown in Figure 2. The estimate exhibits pronounced ridges and large val- 
leys. There are several interesting features of the exhibited ridges: they are 
almost parallel to the speed-axis; they rise and then collapse over the speed 
range; ridges with larger speeds apparently have larger indices; the number 
of pronounced ridges increases from 1 to 3 over the range of speed. The 
interested reader can now return to Figure 1 and understand why the scale 
and error density estimates have those interesting shapes. 

Is it possible that estimates for the second dataset are just products of a 
"spurious" realization of a classical regression indicated in the first exam- 
ple? Let us check this by intensive Monte Carlo simulations based on the 
regression model for the first example and n= 183. Visual analysis of 500 
c.d. estimates revealed that only 32 of those estimates exhibited more than 
one ridge, and in none of those cases did the error density estimate exhibit 
more than one mode. In other words, none of the Monte Carlo simulations 
revealed the pattern observed for the second experiment. Moreover, the au- 
thor analyzed two more experiments, identical to the second one, and they 
exhibited similar patterns for the error density and the c.d. These results 
show that a "spurious" nature of the estimates is unlikely. 

Let us also present results of an interesting Monte Carlo study conducted 
under the null hypothesis "the response and the predictor are independent." 
Suppose that Y and X are independent, Y is standard normal and X is 
standard uniform. In this study the bivariate EP c.d. estimator is compared 
with two univariate kernel oracles: (i) a super-oracle which knows that Y 
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and X are independent and that the estimated univariate c.d. f(y\x) = f(y) 
is standard normal and which uses a Gaussian kernel with the optimal (for 
the underlying standard normal density) bandwidth (see [5], page 358); (ii) 
a sub-oracle which knows that Y and X are independent but does not know 
the density of Y and which uses a Gaussian kernel, but where choice of 
bandwidth is done by the S-PLUS function density. 500 simulations were 
conducted for each sample size, medians of ratios of empirical ISE's of the 
nonparametric estimate to empirical ISE's of oracles then being calculated. 
For sample sizes 50, 100, 150, 200 and 300, the corresponding medians were 
(the numerator presenting a median ratio for the super-oracle and the de- 
nominator for the sub-oracle) 6.6/0.83, 2.21/0.31, 1.95/0.37, 2.27/0.34 and 
2.62/0.52. As we see, the c.d. estimator cannot match the super-oracle which 
knows the underlying c.d., but it performs comparatively well when n > 100. 
At the same time, it outperforms the sub-oracle. 

10. Proofs. 

Proof of Theorem 3.1. We begin by dividing the unit square [0, l] 2 into 
s 2 subsquares, where the known densities p(x) and fo(y\x) are approximated 
by constants. Lower bounds are then established for each subsquare and the 
total is evaluated; this is the plan of the proof. Also, whenever possible, 
random and fixed designs will be considered simultaneously. 

Set s = 1 + [ln(ln(n + 20))J and define H s = {f:f(y\x) = f (y\x) + 
[E^Lo/(fer)(y|^)-E^Lo/o/(fcr)(^l^)^]^((y,a ; ) G [0,l] 2 ),/ (fer) (y|z) EHskr, 
f(y\x) > 0}. The function classes Ti s kr are defined as follows. Let 4>{y) := 
(j>(n, y) be a sequence of flat-top nonnegative kernels defined on the real line 
such that for a given n, the kernel is zero beyond (0,1), it is my-fold con- 
tinuously differentiable on (— oo, oo), < 4>(y) < 1, 4>(y) = 1 for 2(ln(n))~ 2 < 
y < 1 — 2(ln(n)) -2 and its Ith derivative satisfies max y \4>^ l \y)\ < C(ln(n)) 2 ', 
/ = l,...,my. For instance, such a kernel may be constructed using the 
so-called modifiers, discussed in [5]. Let 4> s k(y) '■= 4>{ s y ~ k). Analogously 
define 4> sr (x), with mx replacing my. Set ip s kj(y) ■= s 1 / 2 Lpj(sy — k). For a 
(k, r)th subsquare, < k, r < s — 1, define <fi s kr(yi x ) '■= ^( s y — k)(f){sx — r), 

Pskrjt(y,x) := <Pskj(y)<Psrt(%)<j>skr(y,x), f[kr](y\ x ) '■= J2(j,t)eT(s,k,r) "skrjtVskrjtiy, x) 
and f(k r )(y\x) := f[kr](y\x)^ skr (y,x). The set T(s,k,r) of pairs is the 

difference between two sets defined as follows. Let rj n (Q) be defined by means 
of the relation Y^j,t>o([ a jt/ 7 ln(Q)] 1 ' 2 — a >jt)+ '■= nd~ 1 Q with d defined in 
(3.5), djt = 1 + (vrj) 2my + (7rt) 2mx and (x)+ = max(0, x). Then the larger set 
is {(j,t) : a jt < [r} n {Qskr)]~ 1/2 } with Q shr := Q(l - l/s)(J s _1 I s fe r ) _1 ; where 
hkr :=p(rs _1 )// (^s~ 1 |rs~ 1 ), IJ 1 = Efc^Lo( 1 /^fcr)- The smaller set con- 
sists of pairs (j,t) such that max(j, t) < ln s (n). We can now define Ti s kr '■= 
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{f(kr)(y\x) ■.J2( j ,t )& ns,k,r)i 1 +^) 2mY + (™t) 2m *y akrjt < Q skr , \f [kr] (y\x)\ 2 < 

s 4 ln(n)R n }, where R n := R n (fo,p,S) is defined in (3.8). 

Let us verify that for sufficiently large n, we have Ti s C 5(my,mx,Q, fo, Pn)- 
The definition of the flat-top kernel implies that f(y\x) — fo(y\x), (y,x) & 
[0, l] 2 , is ?ny-fold differentiable with respect to y and mx-fold differen- 
tiable with respect to x. Second, let us verify that for / € 7i s , the dif- 
ference f(y\x) — fo(y\x) belongs to 5(my,mx,Q). Set m = my and be- 
gin with the differentiation with respect to y; in several of the follow- 
ing lines we use the notation tp' l \y,x) := d l ip(y,x)/dy l . By the Leibniz 
rule, (f [kr] (y\x)cf> akr {y,x))^ =T,^ Crf^ l) (y\x)^l{y,x), where := 

m\/((m-l)U\). Note that for < I < m, we have {4>%{y , x)) 2 < C{s{ln(n)) 2 ) 21 
and for / (fcr ) € H skr , 



10.1 



J/Ll (y\ x )4>8kr(V,x)] dxdy 
l] 2 

r(k+i)/s / r (r+i)/s , \ 
<Cs 2l l^(n) ( [f^ l \y\x)fdx)dy 



k/s \Jr/s 

^-l),, 

skrjt 



< Cs 21 ln 4/ (n) Y, i 2(m ~ l 



(j,t)eT(s,k,r) 

•2(m-Z) 

<Cln 4m+1 (n) max ^ , Q srk = o(l) In' 2 (n)Q skr . 

In the last inequality we used the definition of Ji skr and the assumption 
that min(j, t : (j,t) G T s fc r ) > ln s (n). A similar conclusion can be arrived at 
for the derivatives with respect to x. Then, using Parseval's identity, we can 
write for f (kr) E H skr , 

[ff kr] (y\x) + (d m -f [kr] (y\x)/dy m n 2 

[o,i] 2 

(10.2) + (d^f [kr] (y\x)/dx^f}cl> 2 skr (y,x)dxdy 

< £ [1 + (™j) 2mY + (™t) 2mX ]"sk n t < Qskr. 
(j,t)eT(s,k,r) 

Using this, the fact that the function Y^ k ~r=l fkr(y\ x ) and its corresponding 
derivatives are zero at the boundary of [0, l] 2 , Proposition 1 of [7] and the 
fact that J2t~r=o Qskr = Q(l - s -1 ), we can conclude that Y,t~r=i f{kr){y\x) G 
S(mx, Tfiy , Q(l — s -1 )). We are left with the verification that a function 
9s(x) := J2 k ~r=i Jo f(k,r)(y\x) dy belongs to S(m x , m Y , o(l)s _1 ). Write for 
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f(kr) ^ "H-skrt 



9s{x)= / f[kr](y\x)4>skr(y,x)dy 

k,r=0 k / s 

8-1 n{k+l)/s 

2^ / f[kr](y\x)[l-(f)skr(y,x)}dy, 
i- _ r,Jkls 



k,r=0' 



where we use J^J^ f[kr](v\ x ) dy = 0. We then get 



f\gl{x) + {gt x \x)f]dx 
Jo 

< (l)ln^ 2 (n)+ / ^ / tf'feWll-^t^))^ 
JO ,±^ n Jk/s m 



.fc,r=0" 



2 

dx 



o(l)ln" 2 (n) 



This verifies that 7i s C S(rriY,mx,Q, fo, Pn) whenever p n vanishes slowly. 

Let us now establish a lower bound for / € 7i s and any estimate f n (y\x). 
Denote /(y|x) =: f (y\x) + /(y|x), 6 s (a;) := J2k~ r =o Iq 1 f(kr)(u\x) du = 

Y^fXoIo 1 f[kr]( x )(l - 4> skr (u,x))du and note that for f(y\x) G H s and any 
7>0, 

r(k+l)/s r (r+l)/s A 

(/(y|z) -/(y|z)) cfedy 



fc/s Jr/s 

/ (/(y^) - /(fer)(yN) +b s {x)) dxdy 

k/s Jr/s 

Ak+l)/s r(r+l)/s _ 2 

>(l-7)/ / (/(yk) - /[fcr](2/l X )) 

Jfc/s ./r/s 

/■(*+!)/* 2 

-7 / [/[fcr](y|z)(l -<£sfcr(y,aO) + 6 s (a0] 

r(k+l)/s r(r+l)/s _ 
>(l-7)/ / if{y\x) - f[ kr ](y,x)) dxdy 

J k/s Jr/s 

+ (l) 7 - 1 (ln( ? i))- 1 / 2 J R n . 
Then set 7 = s _1 and write 



sup E 1 



(/(y|z) - f{y\x)fdxdy \ 



feS(m Y ,m x ,Q,f ,p) U[0,1] 2 J 
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> sup E\ / (f(y\x) - f(y\x)) 2 dxdy 
feHs U[o,i] 2 

s-1 , r ( k +l)/s r(r+l)/s A 

(10.3) = sup J2 E / / (/(yW-/(i/|aO) 

fen akir=0 Uk/s Jr/s 

s-1 

>(1-S _1 )E SU P E EiiVskrjt ~ Vskrjt) 2 } + o(l)R n 

k r=0 feH 3 kr (j s t)eT(s,k,r) 

s-1 

= :(l-s- 1 )^A fcr + o(l)i? n , 
fc=0 

where £ sfcr jt := /i/ s +1)/s J r ( /s +1)/S f(y\x)Pskrjt(y, x) dxdy. As we see, the origi- 
nal problem is converted into the problem of finding lower bounds for terms 
A kr corresponding to a subsquare; recall that our underlying idea has been 
to approximate the known conditional density fo(y\x) and the univariate 
density h{x) by constant functions on each subsquare. We continue with the 
following steps. First, we introduce an array of independent normal random 
variables Cskrjt with zero mean and variance (1 — ln)v 2 skr ^ v where the pos- 
itive sequence 7 n tends to zero as slowly as desired. We then introduce a 
stochastic process f*(y\x), defined as the f(y\x) S TL S previously studied, 
but with random Csfcrjt used in place of fixed and known v skr j t . The idea 
of considering such a stochastic process was suggested in [33], and following 
along the lines of the establishment of (A. 18) in that article, we obtain 

(10.4) P((f*(y\x) - f (y\x)) G S(m Y ,m x ,Q)) = 1 + o(l). 
Now let us additionally suppose that 

^'skrjt — ^ * ^ ^ then easily verified 

that 

^2 SUp[v skr jt<Pskrjt(y, x)} 2 < Cs 3 R n . 
(j,t)eT(s,k,r) y ' x 

Further, we can introduce a similarly defined stochastic process f* kr y This, 
together with Theorem 6.2.3 in [29], implies the inequality 

p( sup \fr kr] (y\x)\ 2 <s 4 Hn)R r )>l-\o(l)\s- 2 . 
\(y,x)<E[0,l]2 J 

Our next step is to compute the classical parametric Fisher information 
for / € 7i s . Here, different calculations are needed for random and fixed 
designs. Let us begin with the former one where observations are i.i.d. pairs 
(Yi, Xi), I = 1, . . . , n, and thus the Fisher information of n pairs is n times the 
Fisher information of a single pair. For a parameter v skr jt, the "individual" 
Fisher information is 

(10.5) I skrjt := E UotP) {[dln(f(Y\X)p(X))/du skrjt } 2 }. 



CONDITIONAL DENSITY ESTIMATION 



29 



Note that 

din f(y\x) 



krjt 



(10.6) 



a In 



s-1 



fo(y\x)+ f(kr)(y\~ 



k,r=0 



s i 

~ J2 f(kr)(z\x)dz 
i, ,,._n J 



k,r=0 



l((y,x)e[0,l] 5 



dv 



skrjt 



Vskrjt{y,x) ~ Jo <Pskrjt{y,z)dz 

f(y\x) 



I((y,x)e [0,1] 2 )- 



Recall that fo{y\x)p{x) is continuous on the unit square and write 

<Pskrjt(y, X) ~ Jp (fskrjtjz, x) dz 1 ' 



(10.7) I skrjt 
Further, 

/ 

J[o,i]- 



[0,1] : 



fo(y\x)p(x 



fo(y\x)p(x) 



f(y\x) 

<Pskj(y) ( Psrt(x)(f> skr (y,x) 1 ''' 



dx dy. 



f(y\x) 



dx dy 



(k+l)/s f(r+l)/s y skj (y)<plt(x) j , 

My\x)p(x) J -pr- r dx dy 

r/s f 2 {y,X) 



k/s 



+ 



(k+l)/s r(r+l)/ 



fc/s 



r/s 



/o(y|x)p(x) 



^kj(»Ri( I )[&( I 1 !')- 1 ] , , 
x , . dxdy 



(10.8) 



(k+l)/s r(r+l)/s 



k/s 



r/s 



P(y\x) 

[/ (A; S " 1 |r S - 1 )p(A ;S - 1 ) + O (l)] 

flkjiy^lrtix) 



+ o(l)ln~ 1 (n) 



/ 2 (fc S -i|r S -i)(l+ O (l)) 



dx dy 



fo{ks 1 \rs x ) 
= 7 afcr (l + o(l)). 

Note that here o(l) — > as n — > oo uniformly over the considered (k,j,t). 
Also, for j > and all sufficiently large n, we obtain 

'Jo 1 <Pskj{z)Psrt(x)(l>skr{z,x) dz 1 2 



[0,1] 



fo(y\x)p(x) 



f(y\x) 



dx dy 
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(10.9) 



<c 



<c 



dx 



k/ 

i r /■! 



<Pskj(z)<Psrt(x) dz 







/ (1 - if skr (z,x))dz 

Jo 



dx 



dx<C\n 



n 



Combining the obtained results in (10.5), we get I s krjt = ^sfer(l + o(l)) with 
o(l) — > as n — > oo uniformly over the considered (k,r,j,t). Now let us cal- 
culate Fisher information for the fixed design case. Here, observations are 
pairs (Yi,Xi), I = l,...,n, where the predictors are deterministic and the 
responses are independent but not identically distributed. Without loss of 
generality, we can assume that X\ < X2 < ■ ■ ■ < X n . Note that the Fisher 
information of n pairs is equal to the sum of the "individual" Fisher in- 
formation values. Let us calculate this "individual" information for a pair 
(Yi,Xi) with respect to the parameter v s kriti 



(10.10) 



Iskrjt 



Use of a calculation similar to (10.6)-(10.9) shows that I s krjt{l) = fo 1 (ks 1 \rs 1 ) 
ip 2 srt (xi)(l + o{l)) if Xi G [rs" 1 ,(r + l)s- 1 ) and that it is zero otherwise. This 
yields 



^ rt (X z )(l + o(l)) 



J2 I ^jt(l) = f 1 (ks 1 \rs x )s 1 rsrt 

1=1 {l:XiG[rs- 1 ,(r+l)s- 1 ),l<l<n} 

= fo 1 (k S - 1 \rs- 1 )s- 1 

{l:X l £[rs- 1 ,{r+l)s- 1 ),l<l<n} 

= n/ - 1 (^- 1 |rs- 1 )p(r S - 1 )(l + o(l)) = nl skr {l + o(l)). 

We can conclude that asymptotically the average Fisher information is the 
same for both designs. With this remark in mind, we can again continue our 
analysis of both cases simultaneously. 

We are now evaluating rj n and as defined in (3.6)-(3.7). Set a := my, 
(3 := mx, N := l/r/ n and rewrite (3.6) as 

(10.11) i( a JtN) 1/2 - a jt ] = Qd' l n. 

{(j,t):0<a ]t <N} 

The sum in (10.11) can be approximated for large N (or equivalently for 
large n) by the integral 



G N := f 

JUv,x 



{{y,x): {■Ky) 2a + {irx) 2 l 3 <N;y,x>0} 
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-[(7Ty) 2a + (7rx)^])dxdy. 

Let us apply the change of variables u = iryN^ 1 ^ 201 ^ and v = irxN -1 ^ 2 ^ . 
Then 



g n = ^ N i/(^) N i/m N 



Q u 2a +t; 2/3]l/2 _ + v ^]) dv du. 

{(u,v):u 2a +v 2 P <l;u,v>0} 

This yields r] n = ([Qvr 2 Jf 1 (a,/3)][d- 1 n])- 2T /( 2r + 1 )(l + o(l)). To evaluate R* n , 
we again approximate the sum in (3.7) by a corresponding integral and then 
employ the change of variables described above, 



G' N : 



{(y,x\.(Try) 2a +(irx) 2 P<N;y,x>0} 
( 7r )-2 Ar l/(2r) 



1 - [(yry) 2 " + {nxf^^N- 1 ' 2 ) dxdy 
(l-[u 2a + v 2(3 ] 1/2 )dvdu. 

{(u,v):u 2a +v 2 > 3 <l;u,v>0} 

This implies that R* n = P(a, p)Q 1 /Q tr + 1 ) (d/n) 2r /( 2r+1 ) (1 + o(l)). 

We have established all propositions to proceed along the lines of the proof 
of Theorem 1 in [4]. This yields that uniformly over k,r € {0, 1, . . . , s — 1}, 

inf A kr > ( S - 4T Q*) 1/(2T+1) K r )- 2T/(2T+1) F(my,m x )(l + o(l)), 

(10.12) 

where the infimum is over all possible nonparametric estimates of / con- 
sidered in the theorem. Recalling the definition of Q s kr and the fact that 
s = s(n) — > oo, n — > oo, we get 



8-1 



inf E A kr >P(m Y ,m x )Q 1 ^ 2T+1 ^n- 2T ^ 2T+1 h- 4 ^ 2 ^ 



(10.13) 



k,r=0 



E (is 1 Iskr) 
k,r=0 



l/(2r+l)j-2r/(2r+l) 



sfcr 



(l + o(l)). 



Further, 



s-l 



s-l 



E (^- 1 ^)- 1/(2r+1) ^ 2 ; /(2r+1) =(^ 1 )- 1/(2r+1) E I7kr 



k,r=0 



k,r=0 



(10.14) 



(IS 



•l\2r/(2r+l) 



E 

.fc,r=0 



/ (fcs ^rs 1 ) 
p(rs _1 ) 



2r/(2r+l) 
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Using our assumption about continuity of fo(y\x) and p(x) on the unit 
square, we obtain 



-4r/(2r+l) 



s-1 

E 

k,r=0 
8-1 

! E 

fc,r=0 
[0,1] 2 P(^) 



/o(fcs ^rs 1 ) 
^(rs" 1 ) 

/ (fcs -1 |r s-1 ) 



dx dy 



2r/(2r+l) 



2r/(2r+l) 



2r/(2r+l) 



We conclude that 

s-1 

inf ]T 4r>PK,mx)Q 1/(2r+1) 



fc,r=0 



77 



fo{y\x) 

[o,i] 2 p(a?) 



dxdy 



-i 2t/(2t+1) 



(1 + 0(1)). 



This, together with (10.3), verifies Theorem 3.1. 

Proofs of the lower bounds in Theorems 3.2 and 4.1 are similar. Proofs of 
the upper bounds can be found in the technical report [9]. 
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